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Abstract 

Real applications of natural language document processing are very often confronted with domain specific lexical gaps during the analysis 
of documents of a new domain. This paper describes an approach for the derivation of domain specific concepts for the extension of 
an existing ontology. As resources, we need an initial ontology and a partially processed corpus of a domain. We exploit the specific 
characteristics of the sublanguage in the corpus. Our approach is based on syntactic structures (noun phrases) and compound analysis to 
extract information required for the extension of GermaNet's lexical resources. 



1. Introduction 

One of the bottlenecks in real applications of natural 
language document processing is the coverage of domain- 
specific lexical resources. In experiments with the doc- 
ument suite XDOC 1 , we currently are processing docu- 
ments about casting technology, company profiles from 
web pages, and autopsy protocols. Many of the tools have 
an extensive need for linguistic resources. Therefore we are 
interested in ways to exploit existing resources with a min- 
imum of extra work. The resources of GermaNet promise 
to be helpful for different tasks in the workbench. 

In this paper, we will outline how the resources of Ger- 
maNet can be extended. Our methods exploit the specific 
characteristics of the documents in the corpus. We com- 
bine different approaches to extract new concepts from the 
corpus. The idea behind our approach is to generalise from 
structures with known GermaNet entries to structures with- 
out GermaNet entries. 

This paper presents only experiments with GermaNet 
on German texts, but the approach can also be applied on 
WordNet when processing domain specific English texts. 

The paper is organized as follows: The next section 
briefly outlines the test corpus and the integration of Ger- 
maNet in XDOC. Section 3 describes the methods for the 
extraction of new concepts and the results. We conclude the 
paper with a discussion section. 

2. Document Processing with XDOC 

2.1. Characteristics of the Corpus 

In the following description of the approach, a corpus 
of forensic autopsy protocols is used, because these doc- 
uments are especially amenable to processing with tech- 
niques from computational linguistics and knowledge rep- 
resentation. 

Autopsy protocols consist of the following major docu- 
ment parts: findings, histological findings, background, dis- 
cussion, conclusions, etc. Our analyses focus on the sec- 
tions of findings, background and discussion. In the find- 
ings section, a high ratio of nouns and adjectives is encoun- 
tered and the sentences, which can also be verbless, are 



mostly short. This section describes the medical findings 
in a common language. Here we find no domain specific 
(medical) terms. The background and discussion sections 
contain a standard distribution of all word classes and regu- 
lar syntactic structures. The background section describes, 
for example, the details of a traffic accident, while the sec- 
tion discussion contains a combination of the results of the 
finding section and the facts reported in the background sec- 
tion. 

2.2. Integration of GermaNet 

The document suite XDOC contains methods for lin- 
guistic processing of documents in German. The focus of 
the work has been to offer end users a collection of highly 
interoperable and flexible tools for their experiments with 
document collections. 

XDOC consists of different modules, for example, the 
syntactic module and the semantic module (for a more de- 
tailed description see (R osner and Kunze, 20 02b i): 

For the semantic analyses of a domain using XDOC, 
knowledge about the domain - ideally a domain specific 
ontology - is needed. One possible resource for the pro- 
cessing of autopsy protocols could be medical thesauri 
like UMLS (Unified Medical Language System). 2 Many 
of these resources work with medical terminology, but in 
the corpus of forensic autopsy protocols only everyday 
terms are used. Thus a resource that contains everyday 
terms and concepts (and their relations) from the medi- 
cal domain is required for the analysis. GermaNet (see 
( |Hamp and Feldweg, 1997) , ( |Kunze, 200 U ) is intended as 
a model of the German base vocabulary. 

However, specific terms in some particular domains, 
like the medical domain, are covered only partially in Ger- 
maNet. 

For the semantic analysis in XDOC, the synonymy and 
the hypernymy relations of GermaNet are used. We found 
a good coverage of GermaNet's resources for terms in the 
corpus: section findings with 31 %, section background 
with 44 %, and section discussion with 42 % coverage (see 
also (Kun ze and Rosner, 2003) ). The reason for the poor 



XDOC stands for XML based document processing. 



http://www.nlm.nih.gov/research/umls/umlsmain.html 



findings 's result is the high frequent occurrence of medical 
concepts denoted by noun compounds like Nierengewebe 
(kidney tissue) or Halswirbelsaule (cervical spine) that are 
not covered by GermaNet, whereas the individual com- 
pound words like kidney and spine have lexical entries in 
GermaNet. 

In the next section, we will describe how new entries 
can be derived from entries that exist in GermaNet. We start 
with a corpus of autopsy protocols parsed syntactically by 
XDOC and with GermaNet as an initial ontology. 

3. Methods for the Deduction of Word 
Senses 

In (|R osner and Kunze, 200 2a), we outlined some ideas 
for the exploitation of sublanguage characteristics of a cor- 
pus for lexicon creation. In this paper, we will further elab- 
orate these ideas. This section presents how the syntactic 
structures of the corpus sublanguage can be useful for the 
extraction of new GermaNet entries. 

3.1. Fundamental Idea of the Approach 

In the findings section of the documents, high-frequency 
complex noun phrases can be exploited for the extension of 
the GermaNet resources. 

The grammar fragment used in XDOC for this corpus 
covers the following complex noun phrases (In all cases, 
the first NP is a simple noun phrase.): 

• NP NP genitive? 

• NP NPgemUve *PP and 

• NP *PP. 

Our experiments are based on the interpretation of com- 
plex noun phrases that are described by the syntactic struc- 



ture NP — ► NP gen iti V e (i.e. a simple NP modified by a 
genitive attribute). 

In the case of a complex noun phrase, several possibil- 
ities for a semantic interpretation of this syntactic structure 
exist, for example, part-of relations in 'dermis of the 
hand' or pat ient-o f relation in 'the production of cars ' . 



in corpus: 

fracture of <known> 



keyword of complement 




fracture of <unknown> 



in GermaNet: 

class of <known> 



deduce: class of <known> 



Figure 1 : A Sketch of the Idea. 



The idea behind the approach is based on following as- 
sumptions. A structure of the form KEYWORD of COM- 
PLEMENT describes the same relation for every possible 
candidate of the complement, e.g., part-of. Further on, 
an assumption is that the complement candidates of a key- 
word have the same semantic category. The information of 



Table 1 : Some Complements of a Structure Beginning with 
Keyword 'Bruch ' (Fracture). 



complement 


occurrences 


top level of GermaNet 




254 


nomen koerper 


B ru s tbe 1 n 


65 


no me n . koerp er 


^Virbelsa ule 


58 


no me n . koerp er 


S c ha d e 1 d tic h 


43 




Oberschenkelknochen 


37 




Schadelbasis 


34 




Schliisselbein 


33 




Schambein 


30 


nomen. koerper 


B ru s twirbelsaule 


28 




Halswirbelsaule 


26 




Schulterblatt 


23 


nomen. koerper 



complement candidates available in GermaNet is used to 
deduce information about the semantic category of candi- 
dates that are unknown in GermaNet (see also Fig. 0. 

3.2. Exploiting Syntactic Structures of the Corpus 

In the corpus (of 600 autopsy protocols and more than 
1 .5 million word forms), structures in the form of 

are often encountered. For ex- 



NP 



NP NP 



genitive 



ample, the phrase 'Schleimhaut des Magens' (mucosa of 
the stomach) occurs 317 times in the corpus. The more 
generalised phrase 'mucosa of XXX' occurs 836 times in 
the corpus. Another generalised example is the phrase 
fracture of XXX' that occurs 749 times in 93 different 
forms. One example form is the class of NPs with keyword 
Bruch (fracture) and modified by a complement (the second 
noun phrase in the structure), e.g., 'Wirbelsaule' (spine) 
in the phrase 'Bruch der Wirbelsaule' (occurs 58 times) 
or 'Wadenbein' (fibula) in the phrase 'Bruch des Waden- 
beines' (occurs 11 times). Other complements for the key- 
word fracture' found in the corpus are: 'Elle' (ullna), 
' Oberarmknochen' (humerus), ' Schadelgrund' (base of the 
skull), 'Schienbein' (shinbone), 'Unterkiefer' (lower jaw), 
'Unterarmknochen ' ( radial bone) etc. 

At first, structures with high occurrence frequencies in 
the corpus are selected. For this task, the findings sections 
of the documents are parsed with the syntactic parser of 
XDOC. A domain specific grammar with ca. 40 rules is 
used. In the results of 18008 parsed sentences, 2808 com- 



plex noun phrases (NP 



NP NP 



genitive 



) with 1069 dif- 



ferent keywords are encountered. 

The most frequent keywords in such structures are: 'Ab- 
gang' (outlet), 'BauchteiV (abdominal part), 'BrustteiV 
(chest part), ' Blutreichtum' (hyperemia), 'Faulnis' (sep- 
sis), 'Haut' (dermis), 'Schleimhaut' (mucosa), 'Gegend' 
(region), ' Schnittflachen' (cut surfaces), 'Unterblutung' 
(hematoma), and 'Bruch' (fracture). 

The next step is to use regular expressions to get all oc- 
currences of a particular combination of a keyword and a 
complement, because not all occurrences from the corpus 
can be obtained with the chart parser. The reason for this is 
that there are gaps in the grammar (when parsing the section 
background and discussion) and gaps in the morphological 
lexicon. 

The most frequent keywords in regular expressions are 
used to get all phrases that begin with the keyword. The 
length of these phrases (text window size) is restricted to be 
3 tokens (or 4 tokens, when adjectives in the complement 
noun phrase) are allowed. 



For each structure, the GermaNet interface is used to 
check if information about the keyword of the comple- 
ment NP is available. For the example (keyword: frac- 
ture), GermaNet contains 31 complement elements of the 
93 complement elements found in our corpus. Most com- 
plement words of a keyword found in GermaNet have the 
same top level category, only a small number of words 
have more than one reading. For the example, following 
top level categories (given with its percentage related to 
all senses) are encountered: <nomen.Koerper>: 75 %, 
<nomen.Artefakt>: 16,5 %, <nomen.Menge>: 5,5 %, 
and <nomen.Nahrung>: 3 %. All the words with more 
than one sense have at least one sense with the top level 
category <nomen.Koerper>. 

Table ^ presents a small excerpt of the complement 
words 3 in the corpus for the keyword fracture. The 
main top level category for the complement words is 
<nomen.Koerper> (WordNet category: noun. body). 

The first assumption is that all complement words of 
a keyword in a domain will belong to the same top level 
category in GermaNet. That means that those words of 
the example which are not contained in GermaNet, like 
' Oberarmknochen' (humerus), ' Schadelbasis' (base of the 
skull), ' Schadeldach' (calvarium), 'Brustwirbelsaule' (tho- 
racic spine), etc., can be assigned to the same top level 
category: <nomen.Koerper>. In the case of the example 
(key vi oxd fracture), this heuristic yields the correct top level 
category for 93,44 % of all complements. 

In the next step, subclasses of the GermaNet top level 
category will be used , so that a word can be annotated with 
additional information, e.g., hypernymy relation. For this 
task, GermaNet's hypernymy relation is exploited. The hy- 
pernym information for all complements is selected, which 
do exist in GermaNet. The hypernymy relation in Ger- 
maNet can contain more than one level of hypernyms for 
an entry. 

At first, all senses with their hypernym information are 
selected. Each sense and its hypernyms describe a class 
path and each entry in this class path names a semantic 
class. The occurrences of the different semantic classes for 
all senses (class paths) are counted. For the different forms 
of the phrase 'Bruch der/des XXX' (in English: fracture 
of XXX), 36 senses with altogether 63 different semantic 
classes are encountered. Table[2]presents a partial list of all 
semantic classes and its number of occurrences in all the 
senses for the complement elements covered by GermaNet. 
For example, the semantic class 'Knochen' (bone) appears 
in 13 senses as a hypernym, the semantic class 'Computer- 
programm' (software) only in one sense. 

At this point, we don't have a clear and unique re- 
sult. The highly frequent hypernym entries in all senses 
found in GermaNet are the entries: 'Objekt' (object), 
'Hornsubstanz' (akeratosis), 'Knochen' (bone), etc. These 
results can be enhanced when we allow only senses 
that describe a concept with the top level assignment of 
<nomen.Koerper> (see table [3j. The possible senses are 
reduced to 27 senses with altogether 22 different semantic 



Table 2: Hypernym Information for Complement Entries. 



3 The complement words described in table Q occurred in the 
corpus in a singular or plural form. 



hypernym 


number of 


percentage 




occurrences 




<nomen.Tops> — > Objekt 


22 


13.75 


<nomen.Koerper> — > Hornsubstanz 


13 


8.125 


<nomen.Substanz> — > StoffI, Substanz, 






Materie 


13 


8.125 


<nomen.Koerper> — > Korpersubstanz 


13 


8.125 


<nomen.Koerper> — > Knochen, Gebein 


13 


8.125 


nninpn A rt^'fil^f' *""*-> — '"--> A rtf^Talft W/^rT 
< ^ UUlilCll. rW ICltltvl ^ — ^ rW ICltltvL, VVCliv 


7 


4.375 


<nomen.Tops> — > 






Ding, Sache, Gegenstand, Gebilde 


7 


4.375 


<nomen.Menge> — > 






Masseinheit, Mass, Messeinheit, Messeinheit*o 


2 


1.25 


<nomen.Koerper> — > Armknochen 


2 


1.25 


<nomen. Artefakt> — > Computerprogramm, 






Programm 


1 


0.625 


<nomen.Artefakt> — > ?akustisches Gerat 


1 


0.625 



classes. 

When the basic concepts (WordNet's 'unique beginner') 
of GermaNet, e.g. Objekt is ignored, and when the most 
specific hypernym of all high frequent hypernyms is se- 
lected, the following partial class path results: 

<nomen . Koerper>-> Knochen, Gebein 

<nomen . Koerper>-> Hornsubstanz 

<nomen . Koerper>-> Korpersubstanz 

<nomen . Substanz>-> Stoff, Substanz, 
Materie 
<nomen . Tops>-> Objekt 

For the selection of the most specific hypernym, every 
level in the class path is assigned with a weighting factor 
(The selection process can be described by the Eq. The 
unique beginner starts with the factor (in our example Ob- 
jekt), the next higher level get the factor 1, and so on. 



arg max 



N 



(1) 



For each semantic class Ci, the quotient (occurrences 
of the semantic class n(c^) divided by number of all se- 
mantic classes N) is multiplied by its weighting factor /j 
(see also Fig. |2j. In the result above, the semantic classes 
got following factor assignment: fobjekt - 0, f stoff - 1, 



fpCoerpe 



/!>.<! a 



2,} He 



■3,/, 



K 



nochen 



Wadenbein (..) Schienbein(.. 



<nomen.Menge> 

?nicht defnierte 
Langeneinheit 



Langeneinheit 




?kognitves Objekt 



Menge 



not preferred top level category: 
<nome n . Koerper> 




ibsta 



Armknochen (0,11) Beinknochen (., 



Knochen (0,61) 
Hornsubstanz (0,45) 
Korpersubstanz (0,30) 

Stoff, Substanz, Materie (0,15) 
Objekt (0) 



Figure 2: Weighting of Possible Semantic Classes. 



The whole approach described above is sketched in the 
following (given a keyword K and a set of all complements 
C s of K): 

procedure find-entry (K, Cs)'- 



Step 1: for each complement c e Cg: get all (GermaNet) senses of c — >• 

Step 2: ascertain the most frequent top level category in Hs — > T; 

Step 3: remove senses from Hs, which are not assigned with the preferred 
top level category T — > Hsprefer', 

Step 4: for each sense s e Hsprcfer'- collect all semantic classes of the 
hypernym information of s — > SCs', 

Step 5: for each semantic class sc e SCs'- calculate 

Step 5.1: occurrences of sc (n(Ci))/number of all sc (N) — > sc ra tio', 
Step 5.2: sc ra ti times level in the hypernym tree (fi) — > 
sc meight ; 

Step 6: select sc with maximum of sc we i g ht', 

For ca. 80 % of the complement words of the keyword 
fracture this assignment is correct. Erroneous assignments 
result from misspelling of tokens (e.g. Oberschenkelknor- 
ren instead Oberschenkelknocheri) or erroneous fragments 
in the results of the preprocessing steps (e.g., the treatment 
of German's truncations in phrases like Bruch des Ober- 
und Unterarmes (fracture of upper arm and forearm)). An- 
other type of error occurring in the evaluation was the case 
when the second noun phrase can also be parsed as a com- 
plex noun phrase. For the example, only 2 forms are en- 
countered: Bruch der Anteile ... (fracture of parts of...) 
and Bruch der Wandung ... (fracture of septum of...). For a 
reliable evaluation of these results, it is necessary to con- 
sult the domain specific knowledge of a medical expert. 
In some cases, for a non-expert it is not clear if a derived 
sense is correct. For instance, the word 'Ellenbogengelenk' 
(elbow joint) describes a (complex) system of bones, carti- 
lages, connective tissues, etc. 

3.3. Compound Analysis 

An alternative way is to group words according to their 
components. In German and especially in the corpus, 
a lot of compounds are found, e.g., 'Armknochen' (arm 
bones), ' Oberarmknochen' (upper arm bone), and 'Unter- 
armknochen' (forearm bone). GermaNet contains the word 
'Armknochen' , but not the words 'Oberarmknochen' and 
'Unterarmknochen' . For this case, a list of typical prefixes 
of the domain can be made of use. Prefixes in the domain 
are e.g., 'Unter-', 'Ober-', 'Innen-', Aussen-', quasi a pair 
list of antonyms. In this case, the hypernym information 
can be used directly for the new entry. For example, in 
GermaNet following entry of the word Armknochen is en- 
countered: 

1 sense of armknochen 

Sense 1 <nomen . Koerper>Armknochen 

<nomen . Koerper>-> Knochen, Gebein 
<nomen . Koerper>-> Hornsubstanz 

<nomen . Koerper>-> Korpersubstanz 

<nomen . Substanz>-> Stoffl, Substanz, 
Materie 
<nomen . Tops>-> Objekt 

In the corpus, the complement words 'Unterarm- 
knochen' (3 times) and 'Oberarmknochen' (19 times) for 
the same keyword are found. Both have no entry in Ger- 
maNet. The following information for the word 'Oberarm- 
knochen' (similar for the word 'Unterarmknochen') could 
be inserted: 

<nomen . Koerper>Oberarmknochen 
<nomen . Koerper>-> Armknochen 

<nomen . Koerper>-> Knochen, Gebein 



Table 3: Enhanced Hypernym Information for Complement 
Entries. 



hypernym 


number of 
occurrences 


percentage 


<nomen.Tops> — > Objekt 


14 


16.47 


<nomen.Koerper> — > Hornsubstanz 


13 


15.29 


<nomen.Substanz> — > Stoffl, Substanz, 






Materie 


13 


15.29 


<nomen.Koerper> — > Korpersubstanz 


13 


15.29 


<nomen.Koerper> — > Knochen, Gebein 


13 


15.29 


<nomen.Artefakt> — > Artefakt, Werk 






<nomen.Tops> — > 






Ding, Sache, Gegenstand, Gebilde 


_ 


_ 


<nomen.Menge> — > 






Masseinheit, Mass, Messeinheit, Messeinheit*o 






<nomen.Koerper> — > Armknochen 


2 


2.35 


<nomen. Artefakt > — > Computerprogramm, 






Programm 






<nomen.Artefakt> — > ?akustisches Gerat 







<nomen . Koerper>-> Hornsubstanz 

<nomen . Koerper>-> Korpersubstanz 

<nomen . Substanz>-> Stoffl, Substanz, 
Materie 
<nomen . Tops>-> Objekt 

Another kind of compound in the corpus are com- 
pounds with a prefix that describes a body part, e.g. 
'Nierenschleimhaut' (kidney mucosa), 'Brustwirbelsdule' 
(thoracic spine), body part can be named a region of 
the body or an organ. In this case, the following restrictions 
should be considered by the method: 

• both parts of the compound should have an entry in 
GermaNet and 

• the parts of the compound should also appear in the 
corpus as a complex noun phrase: first part of the 
compound is the complement and the second part of 
the compound should be the keyword (e.g., 'Magen- 
schleimhaut' (stomach mucosa) vs. ' Schleimhaut des 
Magens' (mucosa of stomach). 

In these cases, information via GermaNet's meronym 
relation is deduced. 

3.4. Disambiguation 

The fundament of correct deduction of concepts is 
the selection of the correct sense of the senses avail- 
able in GermaNet. In our case, the restriction to one 
top level category is sufficient for this analysis of foren- 
sic autopsy protocols, especially the findings section. 
In this section, only anatomic concepts and its find- 
ings are described. For other domains, it is necessary 
to use methods for a certain word sense disambigua- 
tion, e.g., methods that used selectional preference ( see 
( |Resnik, 1^97|l^OTjAjney and Light~9 99 1) or conceptual 
density ( jAgirre and^igau7^996^ )"for word sense disam- 
biguation. 

4. Related Work 

The approach exploits the specific syntactic structures 
of a sublanguage. In the work of ( Kokkona kis et al., 2 000 1, 
the analyses of compounds and specific syntactic structures 
are used for the extension of the Swedish SIMPLE lexi- 
con. This work exploits the advantage of the productive 



compounding characteristic of Swedish to derive new lex- 
ical items (results in information about semantic type, do- 
main, and semantic class). Furthermore, they used a raw 
and partially parsed corpus for the analyses of enumerative 
NPs (with more than three common nouns) for the deriva- 
tion of co-hyponyms. The following heuristic is used for an 
unknown noun in an enumerative NP: if at least two nouns 
have the same assignment to a semantic class, then there is a 
strong indication that the rest of the nouns are co-hyponyms 
and thus semantically similar with the two already encoded 
nouns. 

The usage of a lexical resource to learn new en- 
tries for the same resource (WordNet) is described in 
( |Navigh and Velardi, 2002) . This paper outlines an ap- 
proach for the deduction of a sense of multi-word terms 
that is based on the senses of individual words of the 
multi-word terms. Another similar approach that com- 
bines corpus and WordNet information to deliver verb syn- 
onyms for high frequent verbs of a domain-specific sub- 
language is described by Xiao (Xiao and Rosner, 2004 1. 
Peters ( Peters, 2004 1 describes how new knowledge frag- 
ments can be derived and extended from synonymy, hy- 
pernymy and thematic relations of WordNet and implicit 
information from the (Euro) WordNet. 

5. Conclusion 

Linguistic resources with domain-specific coverage are 
crucial for the development of concrete application sys- 
tems. In this paper, we proposed an approach for the extrac- 
tion of semantic information, using the information avail- 
able in GermaNet for the individual words that frequently 
occur in a specific syntactic structure of the corpus. 

The results of the approach can be helpful for the cor- 
pus based semiautomatic extension of the GermaNet re- 
sources. With this approach, it is possible to extract infor- 
mation about a new entry (e.g., forearm bone) or to com- 
plete senses or hypernym information for entries existing 
in GermaNet (e.g., lower leg). The results also contain syn- 
onyms, like 'Jochbogen' (zygoma), ' Jochbeinknochen' (zy- 
gomatic bone), and 'Jochbogen ' (zygomatic), which can be 
detected by an deeper context-related investigation of the 
elements of a complement set. 

In future work, we will evaluate the approach for other 
syntactic structures and investigate if it is possible to de- 
duce information about the keyword of a syntactic struc- 
ture when the complements are known. Another aspect 
will be the exploitation of the resources of the Medical 
Subject Headings (MeSH). 4 The investigation points are: 
How many medical terms (in a more everyday language) of 
the forensic autopsy protocols are covered by MeSH? and 
What differences exist between entries of MeSH and Ger- 
maNet, because Basili et al. describe some discrepancies 
between entries in MeSH and WordNet jBasili et al., 2003> . 
Further on, this paper outlines the mapping of a domain 
concept hierarchy (MeSH) with a lexical knowledge base 
(WordNet) for the building of a linguistically motivated do- 
main hierarchy. If such an approach is necessary in the 
analysis of forensic autopsy protocols, it should be consid- 



ered in further analyses of the corpus and the evaluation by 
medical experts. 
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